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Abstract 

Background: In order to tackle the important and challenging problem in proteomics of identifying known and 
new protein sequences using high-throughput methods, we propose a data-sharing platform that uses fully 
distributed P2P technologies to share specifications of peer-interaction protocols and service components. By using 
such a platform, information to be searched is no longer centralised in a few repositories but gathered from 
experiments in peer proteomics laboratories, which can subsequently be searched by fellow researchers. 

Methods: The system distributively runs a data-sharing protocol specified in the Lightweight Communication 
Calculus underlying the system through which researchers interact via message passing. For this, researchers 
interact with the system through particular components that link to database querying systems based on BLAST 
and/or OMSSA and GUI-based visualisation environments. We have tested the proposed platform with data drawn 
from preexisting MS/MS data reservoirs from the 2006 ABRF (Association of Biomolecular Resource Facilities) test 
sample, which was extensively tested during the ABRF Proteomics Standards Research Group 2006 worldwide 
survey. In particular we have taken the data available from a subset of proteomics laboratories of Spain's National 
Institute for Proteomics, ProteoRed, a network for the coordination, integration and development of the Spanish 
proteomics facilities. 

Results and Discussion: We performed queries against nine databases including seven ProteoRed proteomics 
laboratories, the NCBI Swiss-Prot database and the local database of the CSIC/UAB Proteomics Laboratory. A 
detailed analysis of the results indicated the presence of a protein that was supported by other NCBI matches and 
highly scored matches in several proteomics labs. The analysis clearly indicated that the protein was a relatively 
high concentrated contaminant that could be present in the ABRF sample. This fact is evident from the 
information that could be derived from the proposed P2P proteomics system, however it is not straightforward to 
arrive to the same conclusion by conventional means as it is difficult to discard organic contamination of samples. 
The actual presence of this contaminant was only stated after the ABRF study of all the identifications reported by 
the laboratories. 



Background 

Proteomics studies the quantitative changes occurring in 
a proteome and its appUcation for disease diagnostics 
and therapy, and drug development. It examines pro- 
teins at different levels, including their sequences, struc- 
tures and functionalities, and it is considered the next 
step in the study of biological systems, after genomics. It 
is much more complicated than genomics mostly 
because the proteome differs from cell to cell and 
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changes constantly through its biochemical interactions 
with the genome and the environment, while the gen- 
ome of an organism is rather constant. 

Proteins are large linear chains of amino-acids (resi- 
dues). The sequence of amino-acids in a protein is 
directly translated from the information encoded in the 
genome. However, a proteome is more complex than a 
genome. One organism has radically different protein 
expression in different parts of its body, different stages 
of its life cycle and different environmental conditions 
(e.g., in humans there are about 20,500 identified genes 
but an estimate of more than 500,000 proteins that are 
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derived from these genes [1]). This is mainly caused by 
mRNA alternative splicing processes and by the possibi- 
lity of residues in a protein being chemically altered in 
post-translational modification (PTM), either as part of 
the protein maturation processes before the protein 
takes part in the cell's functionalities, or as part of con- 
trol mechanisms. The discrepancy implies that protein 
diversity cannot be fully characterized by gene expres- 
sion analysis. Thus, proteomics is necessary for a better 
characterization of cells and tissues, and for manufactur- 
ing improved drugs and medicines. 

Protein Identification in Proteomics 

One important and challenging task in proteomics is the 
identification of proteins, that is, the recognition of the 
sequenced protein if the protein is known, or its discov- 
ery if it is unknown. For this, protein sequences are 
stored in public databases (such as nrNCBI, UniProt, or 
Genpept), However, they are mostly produced by the 
direct translation of gene sequences. This means that 
neither proteins with post-translation modifications 
(PTM) nor proteins whose genomes have not been 
sequenced would find exact matches in such databases. 

A key experimental technique for the identification of 
proteins is mass spectrometry (MS). Mass spectra pro- 
vide very detailed fingerprints of the proteins contained 
in a given sample. In the so called shotgun approach, 
MS is often combined with cutting-edge separation 
technologies to allow large-scale analysis of proteomes. 
For this, proteins are extracted from cells and tissues, 
enzymatically digested, and the resulting peptides 
(shorter amino-acids chains) separated by multidimen- 
sional liquid chromatography techniques. As the pep- 
tides are separated, they are on-line injected into the 
mass spectrometer, where they are ionized, fragmented 
and these fragments mass-monitored to produce a spe- 
cific sequence fingerprint. 

Identification of the huge amount of spectra produced 
by current state-of-the-art high-throughput analysis is 
one of the major tasks for proteomics laboratories. 
Mainly two popular bioinformatics techniques are 
involved in this effort. The first one takes advantage of 
public genome-translated databases (GTDB) that can be 
accessed through data-mining software (search engines), 
which directly relates mass spectra with database 
sequences. Most of these search engines {Mascot, X! 
Tandem, SEQUEST, OMSSA) are available both as 
stand-alone programs that consult a local copy of a 
GTDB, or as web-services connected to online GTDBs. 
The limitations, once again, lie in their capability of 
identifying missing PTMs or unsequenced genomes. 
The latter case is addressed applying de novo interpreta- 
tion algorithms that yield a sequence for a given mass 
spectrum, thus avoiding any database search. But these 



algorithms cannot become a solution of the problem 
because of intrinsic technical limitations. Once a protein 
has been sequenced de novo, one can look for similar 
proteins in a GTDB using a matching algorithm such as 
BLAST [2] or FASTA [3]; or, alternatively, one can use 
an algorithm such as OMSSA [4] to match spectra 
directly to sequences of a GTDB. 

Mass spectra identification is usually carried out by 
mixing and combining these two techniques. However, 
among other factors, the following issues complicate 
this task: the number of possible PTMs can multiply the 
amount of results to be analysed; bad quality and noise 
in mass spectra increase the uncertainty of interpreta- 
tion; and database errors in sequence annotations can 
lead to misunderstandings in the identification. Conse- 
quently, we get a huge amount of apparently useless 
data (for instance, non-matching mass spectra or low- 
scoring de novo interpreted sequences), which most of 
the times are simply discarded. As a result, this data is 
seldom accessible to other groups involved in the identi- 
fication of the same or homologous proteins. Our con- 
viction is that we can benefit from this kind of data 
making it available as searchable repositories for other 
laboratories. If we compared data coming from different 
laboratories then we would be able to eventually dis- 
cover new matches. The discovery of matches would 
contribute to further discriminate between really waste 
data and possibly good data. We envision many advan- 
tages with this new methodology, as other laboratories 
could provide the missing information for an incomplete 
spectrum or sequence, making a proteine identification 
process succeed; or even more, matches could help to 
recognize new proteins or identify PTMs. 

P2P Networks for Proteomics 

We propose a new scenario where the information to be 
searched is no longer centralised in a few repositories, 
but where information gathered from experiments in 
peer proteomics laboratories can be searched by fellow 
researchers. To avoid centralising all data into a single 
repository -with all the problems that such centralisa- 
tion would entail-, it is better to maintain the informa- 
tion locally at each of the proteomics laboratories. As a 
result, this decentralised data storage needs a decentra- 
Used search mechanism. The use of peer-to-peer (P2P) 
technologies fits our needs. 

A P2P network provides methods for accessing dis- 
tributed resources with minimal maintenance cost. It 
also provides scalable techniques to search through 
large amounts of resources scattered through the net- 
work. Furthermore, joining or leaving the network 
becomes a simple task. These properties of P2P net- 
works make the technology an ideal candidate to imple- 
ment a distributed search mechanism in a network of 
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proteomics labs. Other distributed storage systems such 
as distributed databases or federated storage services 
have been developed with efficiency in mind, and the 
maintenance and joining costs for these solutions are 
very high. 

A proteomics laboratory acting as a peer in a P2P net- 
work would be able to share its complete or partial data 
repository -e.g., mass spectra and de novo interpreted 
sequences- so that other peers can benefit from it. In 
addition, in order to find matches among data coming 
from different peers, the interacting peers of such a P2P 
network would need also to validate and cross check the 
consistency of the information obtained by fellow peers. 

In this article, we describe an approach that imple- 
ments such a P2P network on top of the OpenKnow- 
ledge (OK) system [5,6], which was developed in the 
scope of the European OpenKnowledge project [7]. 

The OpenKnowledge System 

The OpenKnowledge (OK) system is a fully distributed 
system that uses P2P technologies in order to share 
peer-interaction protocols and service components 
across the network. For this, a kernel module - the OK 
kernel- needs to be installed in each machine that is to 
be connected to the system. We shall call the protocols 
and service components to be shared generically Open- 
Knowledge Components (OKCs). Furthermore, these ser- 
vices are executed and coordinated using the same set 
of tools. In the Methods section below we will show 
how the tools of the OK system are used to implement 
the proteomics P2P application. The OK system consists 
of three main services which can be executed by any 
computer running the OK kernel: 

♦ a discovery service consisting of a distributed hash 
table (DHT), by which peer-interaction protocols and 
other OKCs are stored, so that they can be located and 
downloaded by users; 

♦ a coordination service, which manages the peer 
interactions between OKCs; and 

♦ an execution service, which is capable of executing 
the offered service by means of the OK kernel at the 
local machine. 

The workflow for implementing a new application on 
top of the OK platform is as follows. First, a specifica- 
tion defining the interaction protocol linking different 
services has to be defined. This specification is pub- 
lished to the discovery service so that other users can 
find it and can execute OKCs capable of playing the 
roles specified in the peer-interaction protocol. A devel- 
oper, not necessarily the one that specified the protocol 
originally, will develop the OKCs that are to play the 
roles defined in the protocol specification. Some of 
these OKCs may be shared across the network by pub- 
lishing them to the discovery service, so others can also 



execute them on their local machines. At this point the 
application is said to be implemented. 

After the application is implemented, it can be exe- 
cuted on top of the OK system. For this purpose, the 
users wanting to interact as specified in the given peer- 
interaction protocol by playing one of the roles will sub- 
scribe the appropriate OKCs to it. The discovery service 
is in charge of managing these subscriptions, and when 
it gathers enough of these to satisfy all the necessary 
roles in the protocol, it sends this information to a 
designated peer acting as the protocol coordinator who 
will start managing the peer interaction by asking each 
of the components to provide the services when 
required by the interaction protocol. 

The Lightweight Coordination Calculus 

For the case at hand, the developer has to specify a pro- 
tocol of the peer interaction defining the roles each per- 
ticipating peer has to play, the sort of messages sent 
amongst them, and the particular constraints to be 
solved by the OKCs enacting these roles. Several model- 
ling languages such as those reviewed in [8] could have 
been chosen. Our aim, however, is to use the most 
easily applied formal language for this engineering task 
that we could conceive and for which an executable 
peer-to-peer environment already exists, choosing thus 
the Lightweight Coordination Calculus (LCC) [9]. 

LCC is the executable interaction modelling language 
underlying the OK system. It is used to constrain inter- 
actions between distributed components and is neutral 
to the infrastructure used for message passing between 
components, although for the purposes of this paper we 
assume components are peers in some form of peer-to- 
peer network. 

For example. Figure 1 shows the specification in LCC 
of the protocol for sequenced MS spectra sharing that 
we will describe in detail later in the Methods section. It 
is based on a simple query- answering protocol between 
one inquirer and many repliers. 

An LCC specification describes (in the style of a pro- 
cess calculus) a protocol for interaction between peers 
in order to achieve a collaborative task. The nature of 
this task is described through definitions of roles, with 
each role being defined as a separate LCC clause. The 
set of these clauses forms the LCC interaction model. 
An interaction model provides a context for each mes- 
sage that is sent between peers by describing the current 
state of the interaction (not of the peer) at the time of 
message passing. Coordination is achieved between 
peers by communicating this state along with the appro- 
priate messages. Since roles are independently defined 
within an interaction model, it is possible to distribute 
the computation to peers performing roles indepen- 
dently, with synchronisation occurring only through 
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Figure 1 LCC specification of the protocol for sequenced MS spectra sharing. 



message passing. Should the application demand it, 
however, LCC can also be used in more centralised, ser- 
ver-based style. 

Figure 2 shows the main definitions of LCC's syntax. 
A detailed discussion of LCC, its semantics, and the 
mechanisms used to deploy it, lies outside the scope of 
this paper. For these, the reader is referred to [9]. In 
this paper, though, we explain enough of LCC to 
demonstrate how to represent interactions. 



Protocol 




{Clause, . ..} 


Clause 




Peer : : Dn 


Peer 




a.(Role.Id) 


Dn 




Peer \ Message \ Dn then Dn Dn or Dn Dn par Dn null <- C 


Message 




M => Peer | M => Peer <- C \ M <= Peer \ C <- M <= Peer 
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Term | C and C | C or C 


Role 




Term 
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Term, 



Where null denotes an event which does not involve message passing; Term is a structured term (e.g., a 
I'roiog term) and Id is either a variable or a unique identifier for a peer. 

Figure 2 Syntax of LCC. 
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An interaction model in LCC is a set of clauses, each 
of which defines how a role in the interaction must be 
performed. Roles are described in the head of each 
clause by the type of role (and its parameters) and an 
identifier for the individual peer undertaking that role. 
Clauses may require subroles to be undertaken as part 
of the completion of a role. The definition of perfor- 
mance of a role is constructed using combinations of 
the sequence operator ('then') or choice operator ('or') to 
connect messages and changes of role. Messages are 
terms, and are either outgoing to another peer in a 
given role (' => ') or incoming from another peer in a 
given role (' <= ')• Message input/output or change of 
role can be governed by a constraint to be solved before 
(when at the right of ' <-') or after (when at the left of ' 
<-') message passing or role change. Constraints are 
defined using the normal logical operators for conjunc- 
tion, disjunction, and negation. If they are subject to fail, 
the interaction may proceed along alternative paths (e.g., 
those specified with operator 'or'). Notice that there is 
no commitment to the system of logic through which 
constraints are solved -on the contrary we expect differ- 
ent peers to operate different constraint solvers. 

A protocol like the one in Figure 1 is generic in the 
sense that it gives different interactions depending on 
how the variables (starting with a capital letter) in the 
clauses are bound at run time -this depending on the 
choices made by peers when satisfying the constraints 
within these clauses. 

OpenKnowledge Components 

To complete the application, we need also an imple- 
mentation of the OKCs enacting each of the roles. For 
the protocol specified in Figure 1, this means two 
OKCs. One has to enact the researcher role as specified 
in the first two clauses, and another one has to enact 
the omicslab role as specified in the third clause. As a 
result each OKC will need to be able to solve the con- 
straints occurring in their respective role specification. 

For instance, for the omicslab role, the relevant OKC 
must be able to solve the constraint findHit(...), There- 
fore, its implementation must provide at least a findHit 
method. This method should search the local database 
for data that matches a given query. Obviously, this 
implementation will be tightly coupled to the local 
machinery, the file format used for storing this informa- 
tion, and the type of storage system from where it has 
to be retrieved. This is an obstacle for the portability of 
OKCs across different laboratories. Consequently, it is 
advisable that each laboratory develops its own particu- 
lar OKC for the omicslab role to be played, adjusted to 
its own system requirements. However, standard OKCs 
for the most common formats and mass spectrometers 
could be made publically availabe for dwonload and 



sharing. There is no restriction in the OK system to pre- 
vent locally produced OKCs from being pubUshed and 
downloaded by other users. 

Methods 

To show the viability of a P2P-based data-sharing envir- 
onment for the task of protein identification in proteo- 
mics we first specify in LCC a protocol for sharing 
sequenced MS spectra among peer laboratories. Then 
we describe the OpenKnowledge components (OKCs) 
that we have implemented to play the roles specified in 
the protocol, and finally we recount an actual experi- 
ment carried out with the OK system. The aim of the 
experiment is to serve as proof of concept for applying 
P2P technology to the task of protein and peptide iden- 
tification. As such, we do not claim that the experiment 
proves that the OK system for P2P-based data-sharing 
significantly improves all current standard protein and 
peptide identification protocols based on centralised 
database search. The data available for the experiment is 
insufficient in order to come to such conclusion. How- 
ever, we do show in the Results and Discussion section 
below that by using a P2P-based data-sharing environ- 
ment such as the one proposed in this article, research- 
ers gain valuable information that allows them to raise 
the confidence of their identification task. 

For an enhanced selection of those peer laboratories 
that are to participate in the data-sharing protocol, we 
have added a confidence evaluation mechanisms that 
varies over time, and which is based on the expected 
answer accuracy of peer laboratories. 

Protocol Specification 

Figure 1 shows an LCC specification of a protocol that 
guides peer laboratories in their search of each other's 
locally stored proteomic data files. This is only one of 
many possible protocols of this kind. LCC protocols are 
declarative specifications, and as such they are neutral 
to the specifics of a protocol execution. The only 
requirement is that all peers in the network that are to 
interact by means of a given interaction protocol should 
be capable of doing so. This capability amounts to (a) 
running a local copy of the OK kernel, and (b) having a 
local implementation of an OKC capable of resolving 
the constraints relevant to the role a peer is playing in 
the protocol. 

For our proof of concept we have specified a protocol 
-and implemented the required OKCs- for sharing 
sequenced MS spectra among peer laboratories. Ideally, 
to obtain the advantages of peer-based MS spectra shar- 
ing as outlined in the Background section above, we 
should ultimately aim at querying and sharing MS spec- 
tra directly. However, for our proof of concept for vali- 
dating the potential gain of peer-based proteomic data 
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sharing, we have first targeted the implementation of a 
system for sequenced MS spectra sharing. Since the pro- 
tocol models a simple query-answering interaction 
between one inquirer and many repliers, its application 
to MS spectra sharing will depend on the availability of 
OKCs that implement searching based on spectrum-to- 
spectrum matching. The OKCs we have implemented so 
far for our proof of concept, allow searching based on 
sequence-to-sequence matching (by means of BLAST) 
and spectra-to-sequence matching (by means of 
OMSSA). 

For the actual specification of the protocol, two main 
roles are needed, one for the inquirer, which in the pro- 
tocol specification has been termed researcher (the 
clause headed by a(researcher. Researcher)::), and 
another for the replier called omicslab, which will be 
replying to the queries (the clause headed by a(omicslab, 
OmicsLab)::). We will start explaining the latter role 
first, which is simpler. 

♦ omicslab: A peer in this role waits for a message with 
a query from a peer playing the researcher role, then 
solves this query by executing the findHit constraint 
that finds all matching hits in its local database, and 
finally sends these hits back to the researcher peer via 
another message. This is specified as a conditional 
message-passing action that is only carried out when 
the findHit constraint can be satisfied. 

♦ researcher: A peer in this role acts as the inquirer, 
and the role makes use of a researcher subrole that 
includes additional parameters. A peer in the main 
researcher role (the one without parameters) asks the 
user for a query. It does so by launching a input GUI 
with the constraint getQuery and then iterating 
through all the selected proteomics labs participating 
in the peer interaction (obtained via constraints getO- 
micsLabRole, getPeers and selectLabs), It aggregates 
all the different results and displays them to the user 
through an output GUI that is launched by solving 
the constraint showResults, Notice that all these con- 
straints are conditions of an empty message-passing 
action labelled null. (This is syntactical requirement 
of LCC: constraints always go together with a mes- 
sage passing action, which can be the empty one.) 
The iteration through all the omics laboratories is 
currently specified to be done via a recursive helper 
subrole, which receives the query and a list of omics 
laboratory. This role change is executed after getting 
all the identifiers of the peers playing the omicslab 
role by means of the getPeers constraint, which is 
executed by the protocol coordinator, who is the 
peer holding this information. In this subrole, the 
peer first sends a message containing the query to 
the first laboratory in the list, then aggregates the 



response it receives in the message from the labora- 
tory, and finally runs a recursive call to the rest of 
the list. When the list of laboratories is empty it 
returns an empty list of results. We could have 
decided alternatively to specify that queries ought to 
be sent out in parallel to all selected laboratories. 
This would be the obvious choice to speed up the 
querying process, but this is not relevant for the 
objectives of this article. 

Constraints of the researcher role that require user 
input -such as selecting candidate laboratories or writ- 
ing the query- and generate output to the user -such 
as displaying results- are done via so call visual con- 
straints. That is, these constraints are annotated in the 
LCC specification to be solved by means of domain-spe- 
cific GUIs. In our case here they are specially tailored 
for sequenced MS spectra sharing. 

OpenKnowledge Components (OKCs) 

In the following we describe the implementation of 
OKCs that ground the enactment of the protocol of Fig- 
ure 1. As mentioned above, our implementation so far 
allows for searching that is based on sequence-to- 
sequence matching (by means of BLAST) and spectra- 
to-sequence matching (by means of OMSSA). With this 
initial implementation we are capable to run our experi- 
ment that serves as proof of concept of the proposed 
P2P proteomics data sharing environment. 
The researcher OKC 

This OKC implements the constraints relevant for a 
peer that wants to participate in the peer interaction 
playing the researcher role. Hence, the OKCs main task 
is to ask the user for the proteomic query to solve, for- 
ward it to the laboratories, fetch the results, and present 
them back to the user. 

The getOmicsLabRole(RoleName) and getPeers(Role- 
Name, LabList) constraints are used to get the list of 
omics lab peers participating during a particular enact- 
ment of the protocol. The selectLabs(LabList, Selecte- 
dLabList) filters those laboratories to which users want 
to sent the proteomic query. It shows a GUI (Figure 3) 
by which users identify and select the desired peer 
laboratories. 

The getQuery(SearchType, SearchArguments, Input- 
Format, Input) constraint asks the user for the proteo- 
mic query to solve. It requires four arguments that need 
to be provided by the user: 

♦ SearchType: the type of search to be performed 
(BLAST or OMSSA). 

♦ SearchArguments: the parameters to be used by the 
laboratories when executing their locally installed 
search engines. 
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Figure 3 GUI for selecting peer laboratories for the data- 
sharing protocol. 



♦ InputFormat: the proteomic search engines 
(BLAST and OMSSA) allow different input formats, 
this argument is used to inform the search engines 
about the format used in the input. 

♦ Input: the proteomic sequences (if BLAST is used) 
or mass spectra (if OMSSA is used) that constitute 
the input to the search engines. 

To solve the getQuery(SearchType, Search Arguments, 
InputFormat, Input) constraint, a custom visualisation 
(Figures 4 and 5) is shown to the user. With these GUIs 
the user can easily build the proteomic query to be sub- 
mitted to the system by writing or selecting the argu- 
ments of the constraint. 

As soon as researchers have built their query, it can be 
sent to each one of the laboratories that are in the fil- 
tered laboratory list. This task is done by the researcher 
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Figure 4 GUI for building BLAST queries. 
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Figure 5 GUI for building OMSSA queries. 



(LabList, SearchType, SearchArguments, InputFormat, 
Input) subrole. This role iterates the list given to LabList 
using recursion; at each iteration the message query 
(SearchType, SearchArguments, InputFormat, Input) is 
sent to a laboratory of the list, a(omicslab, H), and the 
researcher peer waits for the laboratory peer s response 
message answer(Result, Resultlnfo) to aggregate it. 

When all the various results from the laboratories 
have been collected, the processResults(End) constraint 
is invoked. This constraint launches another custom 
visuaUsation GUI for human users (see Figures 6, 7, 8, 
and 9), by which they can examine the different results 
returned by the laboratories. 

From an architectural point of view, the researcher 
OKC has been divided into three main components: the 
OKC layer, the researcher kernel and the visualisation 
component. With this division, if the protocol specifica- 
tion or the visualisation requirements are modified, the 
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Figure 6 BLAST result window with answers from labs. 
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Figure 7 OMSSA result window with answers from labs. 



corresponding changes to the OKC can be applied 
quickly. The OKC layer acts as a thin interface between 
the OK P2P network and the researcher kernel^ translat- 
ing incoming constraints into researcher kernel method 
invocations. The researcher kernel contains all the utili- 
ties to build the proteomics queries and to parse the 
results provided by the laboratories, and it invokes the 
visualisations when needed. A schematic view of the 
architecture is shown in Figure 10. 
The omicslab OKC 

This OKC implements the only constraint relevant for a 
peer that participates in the peer interaction playing the 
omicslab role, namely findHit(SearchType, SearchArgu- 
ments, InputFormat, Input, Result, Resultlnfo). This 
constraint is to solve the query received from a peer in 
the researcher role and to return the result. The Search- 
Type, SearchArguments, InputFormat and Input argu- 
ments are supplied by the researcher peer, and they are 
used as input to solve the query. The Result and Resul- 
tlnfo arguments are instantiated by the omicslab peer. 
Result contains the result of the execution of the proteo- 
mic search engine and Resultlnfo contains additional 
metadata about the result. 
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Figure 8 OMSSA peptide detail view. 
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Figure 9 OMSSA mass spectrum view. 



From an architectural point of view, as with the 
researcher OKC, the omicslab OKC has been split into 
three main components: the OKC layer, the omicslab 
kernel and the search engine wrapper component. The 
OKC layer function is to act as interface between the 
OK P2P network and the omicslab kernel, mapping con- 
straints into omicslab kernel functions. The kernel task 
is to identify the incoming query and to send it to the 
search engines through the search engine wrapper com- 
ponent. This latter component executes the locally 
installed proteomic search engine and returns the result 
to the omicslab kernel, A schematic view of the architec- 
ture is shown in Figure 11. 

In order to connect a peer playing the omicslab role 
into our OK P2P network it is also necessary to install 
the BLAST and OMSSA proteomic search engines, to 
set up the peer's proteomic database, and to configure 
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Figure 10 Schematic view of the researcher OKC architecture. 



Schorlemmer et al. Automated Experimentation 2012, 4:1 
http://www.aejournal.net/content/4/1/1 



Page 9 of 1 7 



findHit 

incoming constraints 



OKC Layer 



kernel methods invocations 



OmicsLab Kernel 



Search Engine Wrapper 



Blast 
Search Engine 



OMSSA 
Search Engine 



V7 

J protein 



sequence 
database 



Figure 1 1 Schematic view of the omicslab OKC architecture. 



FASTA files to a binary BLAST formatted database 
(by means of the formatdb program provided in the 
BLAST package). Finally, to also pull existing online 
proteomic databases into our P2P network we also 
set up omicslab OKCs whose proteomic databases 
were downloaded from institutions such as NCBL 
♦ Configuration: To configure a peer to use the 
search engines, each machine acting as omicslab 
peer contains a configuration file. By reading this 
configuration file the omicslab peer knows where the 
search engines are locally installed, what database it 
should be using for each search, and the default 
parameters to use with the search engines. A frag- 
ment of a configuration file can be seen in Figure 12. 



Experimentation 

For the experimentation we have drawn from real data 
obtained from the ProteoRed scientific community. In 
the following we briefly describe this community, the 
test data employed, how the experimentation has been 
set up, and the concrete data-sharing peer interaction 
launched. 

The ProteoRed Scientific Community 

The National Institute for Proteomics, ProteoRed, is a 
network for the coordination, integration and develop- 
ment of the Spanish proteomics facilities providing 



the peer to use the search engines over its database. 

♦ Search engines: We decided not to include the 
search engines as part of the omicslab OKC, to 
make it platform- and search-engine-independent, as 
BLAST and OMSSA can be freely downloaded from 
NCBI, the National Center for Biotechnology Infor- 
mation (http://www.ncbi.nlm.nih.gov). It is required 
to install them locally in every machine acting as an 
omicslab, 

♦ Databases: For setting up the protein sequence 
database with the mass-spectra data returned by the 
lab's local mass spectrometers we processed each set 
of mgf files (a common format to collect mass-spec- 
tra) using the de novo interpreter tool PEAKS, which 
was available to all ProteoRed lab members (see 
Experimentation section below) to obtain a corre- 
sponding set of amino-acids sequences. Before build- 
ing the database of sequences we applied a filter 
over de novo results discarding short sequences (less 
than 4 bases) and duplicates. We assume the first 
(highest score) sequence to be the best de novo 
interpretation; after that, the de novo score is not 
taken into account anymore, although its value is 
annotated in the database as header information. 
The final step consists of formatting these plain text 
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Figure 12 Fragment of the configuration file used by the 
omicslab peers. 
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services to support Spanish researchers in the field of 
proteomics. As of 2009 ProteoRed integrated 19 well- 
established proteomics facilities giving services all over 
Spain and abroad. 

ProteoRed offers major services necessary in all stages 
of the protein analysis process, and its main objective is 
to increase the specialisation and competitiveness of 
proteomics facilities, considering the type of technolo- 
gies and equipment available, and the type of customers, 
their expertise and their geographical situation. Custo- 
mers are research groups from universities, the CSIC, 
hospitals, or other public institutions, as well as private 
companies (biotech and pharmaceutical companies). 
ProteoRed also has the objective of testing new techno- 
logical developments to provide new proteomics meth- 
odologies and equipment to the Spanish proteomics 
facilities. It also establishes open channels with custo- 
mers of these proteomics services to know their techno- 
logical needs, data accuracy, quality requirements, price 
scales, and new services needed for the future. Pro- 
teoRed also takes care of the coordination of courses, 
workshops and meetings to promote and enhance the 
quality of proteomics knowledge through the scientific 
community, ProteoRed technicians and governmental 
agencies. 
The Test Data 

For our test data we have decided to use preexisting 
MS/MS data repositories from the 2006 ABRF (Associa- 
tion of Biomolecular Resource Facilities) test sample. It 
consists of a mixture of 48 purified and recombinant 
proteins (plus an unknown number of protein contami- 
nants) extensively tested during the ABRF Proteomics 
Standards Research Group 2006 worldwide survey. 

Seventy-eight laboratories participated in the analysis 
of these mixtures, some of them members of the Pro- 
teoRed network. Among these, only 35% could correctly 
identify more than 40 protein components. Thus, the 
sample, being relatively handy for the purpose of testing 
the OK system, is still of enough complexity to become 
a challenge for most proteomics laboratories. 

This sample was prepared by combining five picomole 
aliquots of each protein. For this purpose, individual 
proteins were previously purified to assure a purity 
>95%, and the protein concentration determined by 
amino-acid analysis. The combined sample was lyophi- 
lized in 1 mL polypropylene tubes for storage before 
analysis. The presence of low levels of impurities in the 
mixture represented an additional challenge to this ana- 
lysis. Thus, in addition to the 48 standard proteins, 
most laboratories reported the identification of many 
other proteins. These identifications could be either due 
to real contaminants or to false positive identifications. 
To ascertain which was the case requires a careful ana- 
lysis of the full data obtained by a laboratory. 



This ambiguity could be rapidly solved querying the 
OK system and searching for other laboratories report- 
ing the same unexpected identifications as we show in 
the test experiment. This case is an example of a more 
general situation when a laboratory needs to evaluate 
the confidence of results that cannot be supported by 
other means -such as a high confidence match in a 
database- by checking if the same data has been 
obtained by a number of independent partners working 
with similar biological samples. 
Experiment Setup 

To test whether the OK system could be used by the 
ProteoRed community in order to speed up the protein 
detection process, we simulated an environment in 
which several peers of the OK system were emulating 
real proteomics laboratories. Through this environment 
a researcher could query these peer laboratories to 
retrieve data from their local databases. 

Since we did not have access to the entire MS/MS 
repository of the 2006 ABRF test sample, we set up a 
P2P ntework of those 9 laboratories of the ProteoRed 
community that made their data available for our 
experimentation. Each of these peers managed its own 
database containing protein data extracted from the 
ABRF test sample. To serve as an interface between the 
ABRF database and the OK system we implemented an 
omicslab OKC for each peer and subscribed it to play 
the omicslab role in charge of replying to incoming pro- 
teomic queries as specified in the MS spectra-sharing 
protocol of Figure 1. In addition, we gave proteomics 
researchers a tool that allowed them to search for pro- 
teomic information through the OK P2P network by 
sending queries to omicslab peers and retrieving data 
from their databases. For this, researchers had to set up 
their access to the OK system by executing the follow- 
ing set of steps: 

1. Installing the OK kernel. All researchers need to 
link into the OK system by installing the OK kernel 
[6] on a computer with an internet connection, in 
any operating system, with the only requirement 
that it has the Java 1.5 suite installed. 

2. Searching for the protocol specification. The OK 
system supports that different peers interact accord- 
ing to peer-interaction protocols. These protocols 
are to be specified and made public in the OK sys- 
tem. Users of the OK system can then search for 
protocol specifications that define the type of inter- 
action according to which they would like to interact 
with other peers. In our current scenario researchers 
will use the browsing facility provided by the user 
interface of the OK kernel application to search for 
the appropriate protocol specifications. Searching is 
achieved by sending queries with keywords to a 
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discovery service, which is in charge of storing all of 
the published OKCs and protocols. (The discovery 
service is itself not a centralised, but a completely 
distributed service that follows the decentralised 
approach taken by P2P networks.) The discovery ser- 
vice then retrieves all the protocol specifications 
whose metadata matches the query. The researchers 
can then read the descriptions associated to the 
retrieved protocol specifications in order to select 
the one that suits them best. If no specification suits 
them, they can refine the query by using different 
keywords. 

3. Installing the researcher OKC, Recall that the pro- 
tocol specified for our experiment defines two roles: 
the omicslab, which is in charge of replying to 
queries, and the researcher, which is the one sending 
out queries (see the Protocol Specification section 
above). Actual researchers that want to query peer 
laboratories will have to do so by enacting the 
researcher role as specified in the protocol. For this 
they must have previously installed an OKC capable 
of solving the constraints attached to that role (see 
the OpenKnowledge Components (OKCs) section 
above). OKCs can also be published in the OK sys- 
tem so that it is easy for users to find them and 
install them locally on their computers. Downloading 
and installing an OKC from the OK system can be 
achieved directly from the kernel's user interface. 
Once both the protocol specification and the role 
that a user wants to play have been chosen, one 
needs to search for existing OKC implementations 
for the given role, and then download and install 
them. At this point a researcher would be ready to 
start launching proteomic queries. Although this is 
the simplest way to install the required OKC, an 
advanced user may also find or develop an OKC 
through other means and plug it into the kernel. 

4. Subscribing to the researcher role. Once the actual 
researchers have installed the OKC needed for play- 
ing the researcher role, they can start the peer inter- 
action through a subscription. Users select the 
protocol specification and role they want to play and 
then run the subscription command from the user 
interface. This command sends the subscription 
information to the discovery service, which will 
define a peer (the researcher or another peer in the 
network) that will act as a coordinator of the peer 
interaction, following the protocol that will govern 
the peer interaction between the researcher and 
those peers that have subscribed to play the other 
roles defined in the protocol. In our case these will 
be the laboratory peers subscribed to play the omic- 
slab role. 



Protocol Enactment 

When the protocol and OKCs are in place, and suffi- 
cient peers have subscribed to the required roles of the 
protocol, the protocol itself can be enacted. 

1. Selecting the laboratories. When the interaction 
starts, the protocol iniciator peer recevies the list of 
all the peers that have subscribed to the omicslab 
role is received by the peer. This list is shown to the 
user which has to select the subset of the labs that 
he or she wants to query. 

2. Building the proteomic query. Having selected the 
laboratories to which the query will be sent, the user 
will then have to create the query. This is also done 
through a user interface providing users with a form 
through which they can specify the query. This form 
consists of: 

♦ an item asking to select the type of search 
(BLAST or OMSSA); 

♦ a text box in which to write the proteomic text 
query or import it from a file; 

♦ another text box where the researcher can add 
meta-data annotations that are used if the confi- 
dence on the returned results is to be deter- 
mined (see the next subsection); and 

♦ a subform where the user can enter custom 
search arguments to be used by the search 
engines. 

Once the query has been introduced by the researcher 
it is sent to all the selected omicslab peers so that they 
can process it and reply with the set of matching pro- 
teomic data. 

3. Showing the results from laboratories. Every time 
an omicslab peer replies to the query, the researcher 
OKC stores the results. Once all the omicslab peers 
to which the query was sent have replied, the results 
are shown to the users via a custom visualisation. 
Through this custom GUI researchers can browse 
through the different results and compare them. 
This is the final step of the protocol execution, if the 
researchers want to make another query they simply 
start another interaction. 



Confidence in Peers 

In our scenario, researchers send queries to and receive 
answers from a set of peer proteomics laboratories. For 
these researchers it is important to have a mechanism 
helping them to distinguish which laboratories return 
more significant and relevant answers to their queries. 
This can be achieved by measuring the confidence that 
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researchers have in each laboratory, a confidence that is 
built during successive queries launched by the 
researcher to their peer laboratories. 

In [10], the trust on a peer (a proteomics lab in our 
scenario) is defined as the overall satisfaction with 
previous experiences with that peer. In our case the 
satisfaction measure of each particular experience with 
a laboratory is based on the similarity between the 
query and the answer obtained from the laboratory. 
This similarity, however, is not only the similarity 
between amino-acid sequences - it also takes into 
account additional information such as the enzyme 
chosen for digestion during sample preparation, the 
type of mass spectrometer used, or the kind of organ- 
ism that the sample was taken from, among other 
information. 

The similarity between a query and a laboratory hit 
(the basic building block of the confidence calculation 
in this application domain) is then defined as the pro- 
duct of two factors: 

spectrum similarity = spectrum significance x protocol similarity 

where the value of spectrum significance represents 
how significant the matched spectrum is with respect to 
the query spectrum, and the value of protocol similarity 
represents how similar the spectrographic protocols are 
that where followed by the researcher when obtaining 
the spectrum in the query and the laboratory when 
obtaining the spectra in its database. Let us explain 
these two similarity measures in more detail. 
Spectrum significance 

To calculate the significance of a spectrum (or of its 
associated sequence of amino-acid characters) search 
engines that work over databases of amino-acid 
sequences usually report a score S together with a prob- 
abilistic value, referred to as P-value. The score 5' is a 
measure of the similarity of the query to the sequence 
matched, and the P-value is a measure of the reliability 
of this score. It is the probability due to chance that 
there is at least another match with score greater than 
or equal to S, (Here chance means the comparison of (i) 
real but non-homologous sequences; (ii) real sequences 
that are shuffled to preserve compositional properties; 
or (iii) sequences that are generated randomly based 
upon a DNA or protein sequence model [11].) But 
instead of determining P-values directly, search engines 
such as BLAST or OMSSA report the so-called £-values, 
which are easier to interpret. The £-value is the number 
of times a sequence with a score greater than S may 
occur by chance in the database. That is, the closer the 
£-value to zero, the better the match. But as £-values 
are not normalised in 0[1], we take the P-value for 



computing spectrum significance, which can be derived 
from the reported £-value as follows: P = 1 - e'^. 

The second factor that we consider is the number of 
significant spectra in the set M of matched sequences 
reported by a peer laboratory with respect to the score 
distribution of the sequences of M, Thus, given a query 
sequence, the overall significance of the set of matched 
sequences provided by a peer laboratory is given by 



N 



where N = {m g M \ > Gm }i being Gm 



the standard deviation of scores in the population M, 
Protocol Similarity 

In its graphical representation a mass spectrum is a 
plot of the mass to charge values of the detected ions 
versus their corresponding intensity. Fragmentation 
spectra from peptides show profiles that are character- 
istic of the peptide sequence and that contain different 
types of ions [12]. In addition to the returned spectra, 
researchers need to have some confidence that the way 
in which the spectrum in the query has been obtained 
is comparable to the way in which the hits are 
obtained. This is because, although the protocol fol- 
lowed by a laboratory may be well defined, the proto- 
col itself admits certain variations that will produce 
spectra with different ion types. (Bear in mind that this 
is not the peer-interaction protocol specified in LCC 
for data sharing in our P2P network.) These variations 
include the enzymes used to modify and to digest the 
amino-acids, and the type of mass spectrometer used 
to produce the spectra. 

Another important factor is the organism from which 
the protein has been obtained. All this information is 
provided as metadata. We define the protocol similarity 
of a hit between a query and a database entry as 



protocol similarity -■ 



organism similarity x modification similarity x digestion similarity 
X mass - spectrometer similarity 



Organism similarity The semantic similarity among 
organisms Oi and 02 used for our confidence evaluation 
is based on the organism taxonomy tree as according to 
the NCBI lineage. Figure 13 shows a fragment of this 
taxonomy. 

To defined a similarity measure we have used the one 
described in [13] and repeated here: 



Sim{oi,02) = 



1 



-Kil . ^ 



if Oi = O2 

Otherwise 



where / is the length (i.e., number of edges) of the 
shortest path between nodes, h is the depth of the dee- 
pest node subsuming both nodes, and ^1 and ^2 are 
parameters balancing the contribution of shortest path 
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Archaea Eukaryota Bacteria Viruses 



Alveolata Metazoa Fungi Viridiplantae 

Chordala 

I 

bony vertebrates 

I 

lobe-finned fish and 
tetrapod clade 

1 

Mammalia 

I 

Primates 

I 

Homo sapiens 

Figure 13 Fragment of the organism lineage tree. 



length and depth respectively. For example, the path to 
humans is: 

cellular organisms; Eukaryota; Fungi/Metazoa group; 
Metazoa; Eumetazoa; Bilateria; Coelomata; Deuteros- 
tomia; Chordata; Craniata; Vertebrata; Gnathosto- 
mata; Teleostomi; Eu-teleostomi; Sarcopterygii; 
Tetrapoda; Amniota; Mammalia; Theria; Eutheria; 
Euarchontoglires; Primates; Haplorrhini; Simiiformes; 
Catarrhini; Hominoidea; Hominidae; Homo/Pan/ 
Gorilla group; Homo; Homo sapiens 

and the paths to the rat, the sole, and the rhesus 
macaque are, respectively: 

♦ cellular organisms; Eukaryota; Fungi/Metazoa 
group; Metazoa; Eumetazoa; Bilateria; Coelomata; 
Deuterostomia; Chordata; Craniata; Vertebrata; 
Gnathostomata; Teleostomi; Euteleostomi; Sarcop- 
terygii; Tetrapoda; Amniota; Mammalia; Theria; 
Eutheria; Euarchontoglires; Glires; Rodentia; Sciurog- 
nathi; Muroidea; Muridae; Murinae; Rattus 

♦ cellular organisms; Eukaryota; Fungi/Metazoa 
group; Metazoa; Eumetazoa; Bilateria; Coelomata; 
Deuterostomia; Chordata; Craniata; Vertebrata; 
Gnathostomata; Teleostomi; Euteleostomi; Actinop- 
terygii; Actinopteri; Neopterygii; Teleostei; Elopoce- 
phala; Clupeocephala; Euteleostei; Neognathi; 
Neoteleostei; Eurypterygii; Ctenosquamata; Acantho- 
morpha; Euacanthomorpha; Holacanthopterygii; 
Acanthopterygii; Euacanthopterygii; Percomorpha; 
Pleuronectiformes; Soleoidei; Soleidae; Solea; Solea 
senegalensis 

♦ cellular organisms; Eukaryota; Fungi/Metazoa 
group; Metazoa; Eumetazoa; Bilateria; Coelomata; 



Deuterostomia; Chordata; Craniata; Vertebrata; 
Gnathostomata; Teleostomi; Euteleostomi; Sarcop- 
terygii; Tetrapoda; Amniota; Mammalia; Theria; 
Eutheria; Euarchontoglires; Primates; Haplorrhini; 
Simiiformes; Catarrhini; Cercopithecoidea; Cerco- 
pithecidae; Cercopithecinae; Macaca; Macaca 
mulatta 

Taking as parameter values Ki = 0.02 and K2 = 0.6 we 
get the following similarities: 

Sim(homo sapiens; solea) = 0.46 
Sim (homo sapiens; rattus) =0.72 
Szm (homo sapiens; macaca mulatta) =0.81 

The NCBI database of organisms and taxonomies is 
dynamic and very large (currently there are more that 
300,000 organisms), but it provides a REST web service 
to get taxonomic information without the need to 
download the entire database. (REST is a method to 
make web service queries where the query is written as 
an URL over HTTP.) 

Modification similarity To calculate the similarity of 
modification terms, rather than a tree distance we have 
used a binary similarity table (see Table 1). This is 
because there are situations that cannot be compared. 

For example, a peptide modified with oxidation of an 
amino-acid cannot be compared with another peptide 
that has not been modified at all. 

Digestion similarity To calculate the similarity of the 
digestion terms we assign a similarity value of 1 between 
identical enzymes, and also between Trypsin and LysC, 
but only if the peptide ends with K. In all other cases 
we assign a similarity value of 0. 

Mass-spectrometer similarity Finally, the similarity of 
the mass spectrometers is calculated based on the tax- 
onomy tree depicted in Figure 14 (see Table 2 for acro- 
nyms). The tree classifies the spectrometers according 



Table 1 Similarity table for peptide modification 



modification code 


-1 1 


2 


3 


5 


31 


32 


89 90 


-1 
1 

2 


1 

1 1 
1 0 














3 


1 0 




1 










5 


1 0 




1 


1 








31 


1 0 




1 


1 


1 






32 


1 0 




1 


1 


1 


1 




89 


1 1 


0 


0 


0 


0 


0 


1 


90 


1 1 


0 


0 


0 


0 


0 


0 1 



-1 = nothing; 1 = oxidation of M; 2 = carboxymethyl C; 3 = carbanidomethyl 
C; 5 = propionamide C; 31 = carbamylation of K; 32 = carbamylation of n- 
trem peptide; 89 = oxidation of H; 90 = oxidation of W. For the remaining 
codes the similarity is 1 if they are equal and 0 if the are not equal. 
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Spectrometer 



Tandem Spectrometer 




Type I 



Type II 




51-QTOF 

ESI-TOF/TOF 



Type III Type IV 

/\ /\ 

MALDI-LIT MALDI-TOF/TOF ^TD MS ECD MS 



ESI LIT-FT (HOD) 



Figure 14 Similarity tree for different mass spectrometers. 



to the type of ion fragmentation profiles that are typi- 
cally produced. The main classification parameters are 
the ionization method (MALDI or ESI ionization), 
which determines the type of precursor ions formed, 
and the collision energy (high- or low-energy collision). 

Results and Discussion 

To play the researcher role we randomly selected 38 
peptide sequences obtained by PEAKS de novo analysis 
of the mass spectrometric analytical data obtained by 
the LP CSIC/UAB proteomics laboratory from the 
ABRF sample. Additionally we included data derived 
from spectra that during the original analysis matched 
to proteins not included in the standard ABRF list. 

The mass spectrometric data set from LP CSIC/UAB 
was obtained by LC-MS/MS analysis of the tryptic 
digest of the protein mixture in the ABRF sample. This 
data set included 2000 spectra from which 48 of the 49 
proteins in the ABRF standard could be identified by 
conventional proteomic data analysis. Each protein was 
identified from the sequence of one or more of its tryp- 
tic peptides. Queries were performed against 9 data- 
bases, including 7 proteomics labs, the NCBI Swiss- Prot 
database, and the database of the researcher itself. Mak- 
ing the researcher laboratory play both roles in the peer 



Table 2 Mass spectrometer acronyms 

acronym 



mass spectometer 



ESI electrospray 

MALDI matrix assisted laser desorption 

QIT quadrupole ion trap 

QqLIT hybrid quadrupole-linear ion trap 

TSQ triple stage quadrupole 

QTOF hybrid quadrupole-time of flight 

TOP time of flight 

LIT-FT(HCD) hybrid LIT-Orbitrap with high collision dissociation 

ETD fragmentation by electron-transfer dissociation 

ECD fragmetnation by electron-capture dissociation 



interaction {researcher and omicslab) served as a true 
positive control to check for failures in the functionality 
our system. 

To evaluate the evolution of the confidence of 
reported answers to queries, the 48 sequences were 
divided into 4 groups to be able to mimic the building 
of a query history. Despite that this model does not pro- 
duce a valid history (as sequences were grouped ran- 
domly), it will allow to evaluate the functionality of the 
confidence calculation. 

Each group was queried with the same parameters 
(Figure 15) and the results analysed in the researcher 
OKC prospector window (Figure 16). As expected, the 
search in the researchers database (column labelled with 
uab) generated always full coincidences. Contrarily, 
other proteomics labs and the NCBI Swiss-Prot database 
(labelled with ncbi) produced more diverse results. Most 
of the queries produced high percentage identity values 
in the ncbi search. These hits give direct information 
about the identity of the peptide and the source protein 
('id' and'des' text windows in Figure 17). One of the 
queries in Figure 17 (Query 10) produced a 100% coin- 
cidence in the NCBI Swiss-Prot database. The expecta- 
tion values for this match indicated that it was not due 
to hazard. The protein that had been tentatively identi- 
fied, P20160 (azurocidin precursor) was, however, not 
included in the list of component of the standard ABRF 
sample. 

The analysis of the answers from other proteomics 
labs for this sequence showed that other laboratories 
found identical (see, i.e., cpif) or highly homologous 
sequences (see, i.e., ucm). This fact indicated that several 
laboratories had observed the presence of the same 



[ Clear 



>W|./uab AeRF060735.RAW. 
YGDOLVLRR 

>td| ./uab_ABRF060735,RAW. 
CCRRWLFASERR 
>»d| ./uab_A8RF060735. RAW . 
MRDFFYVRWRCHAHFVRDRK 
>id|./uab ABRF060735.RAW. 
YNGFFAGRLYR 

>ld| ./uab_A6RFCI60735.RAW. 
HGGVFSARK 

>»d| ./uab_A8RF060735.RAW. 
NPWHTAGCTYSPESGSGGVHK 
>«| ./uab_ABRF060735.RAW. 
TMGAAAAAPPPR 



,arc_Spectra\553. 17816_2.ann 0.5871377 VMCBITAXID-9606 / 
,anz_spectra\565.865M_3.ann 0. 1 1253427 \NCBITAXID-9606 \M00RES-(1 
,anz_spectra\933.'1032_3.ann 0.08788036 \NCBITAXID-9606 \MOOR£S-(1 1 
,arB_spectra\682.7426_2.ann 0.30140257 \NCBITAXID-9606 
,anz_spectra\479.22S13_2.af¥i 0. 15925884 \NCBITAXID-9606 
,arc_spectra\747.5762_3.ann 0.040508583 \NC8ITAXID-9606 \|^K)OftES-<? 
ar«_spertra\556.l30$_2.ann 0.04538134 \NCBITAXID-9606 



Search Type 

□ localsearch [7] ok P2P Network search □ local NCBI database ssarCh [7| NcBI searchj 



NCBI database: Swissprot protdn sequences (swtssprot) 



Blast General Parameters 


Blast Scoring Parameters 


Max target sequerxes: 100 vj 
Automatically adjust parameters for short Input sequence: 


Matrix: PAM70 v 
Gap Costs: Existence 1 1 Extertsion 1 v j 


Expect threshold: { 100 


Compositional adjustments: No adjustmei* ^ 



Figure 15 Query window and BLAST search parameters used 
for this study. The sequences shown in the image correspond to 
the first group of queries. 
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Figure 16 OK-omics Prospector window. Responses to the first 
group of queries (% coincidences). 



component in their samples and supported the fact that 
the queried sequence was not the result of noise or lab- 
specific sample preparation artifacts. More detailed ana- 
lysis of the results indicated that the presence of protein 
P20160 was supported by other NCBI matches (see, i.e., 
Query 11 in Figure 18) and that the corresponding 
query also had produced highly scored matches in many 
proteomics labs. This analysis clearly showed that 
P20160 was a relatively high concentrated contaminant 




Mt|otal NCBI svnssprot 
M |6746tsp|P201 60|CAP7_HUMAN 



furocHtm precursor (C atonic ar 



|GPorrTR 

[GPOrflR 
[cPOfFTR 



|GPDFFTR 

kd|.«iab_AeRf060727.RAW.an7_spectra«}9.477_1 



Figure 17 Match data from ncbi database for query 10. 
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^|tran*mula(lon ono-COPaASLONKCR 



Figure 18 Match data from ncbi database for query 11. 



(several laboratories detected several of its tryptic pep- 
tides) that could have been present in the ABRF sample. 

This fact is evident from the information that can be 
derived from the OK system. However, it is not straight- 
forward to arrive to the same conclusion by conven- 
tional means, as it is difficult to discard organic 
contamination of the samples. The actual presence of 
this artifact was only stated after the ABRF study of all 
the identification reported by the laboratories. 

The confidence on the result of each proteomics lab 
was evaluated for the 4 queries performed (Figure 19). 
Confidence values for the different laboratories are in 



Trust evolution in time 





\ •-■ rnb cpif ptb 



■icm uco ■ ■ upv vh -O-lc'Cai NCBI swlssF«ot I 



Figure 19 Confidence evaluation. 
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the range from near 0 to near 1 indicating different effi- 
ciencies sending back high score matches for the quer- 
ied sequences. No improvements on the quaUty of the 
information derived from the OK-omics system were 
observed by selecting 2-3 of the more trusted labs for 
these queries. Due to the small size of the databases, an 
important fraction of the processing time was due to 
the public NCBI database search. Selecting a few labora- 
tories of high trust could however increase the perfor- 
mance when a higher number of peers are involved in 
the interaction. As expected by the origin of the data 
(sequences randomly taken from the LP CSIC/UAB data 
set) trust values are stable over the experiments. 

Conclusions 

We have presented a new form of data sharing for 
expression proteomics with the aim of (1) augmenting 
significantly the percentage of peptides and proteins 
to be sequenced and identified by means of mass- 
spectrometry-based analysis, and (2) reducing signifi- 
cantly the sequencing and identification time needed. 
For this we have combined current bioinformatics 
techniques for proteomics with a novel multiagent 
system architecture and a distributed knowledge coor- 
dination mechanism in peer-to-peer networks, which 
have been developed in the context of the OpenKnow- 
ledge EU project. 

In this article we have specified the data-sharing peer- 
interaction protocol for P2P proteomics, implemented 
the P2P data-sharing system using the OpenKnowledge 
system, and carried out a feasibility experiment with test 
data from preexisting MS/MS data repositories from the 
2006 ABRF test sample provided by different labora- 
tories for the ProteoRed scientific community. 

We conclude that by using the proposed P2P data- 
sharing system and protocol a researcher is capable of 
deriving information from the test data that is not 
straightforward to obtain by conventional means. This 
in turn shows that P2P data-sharing in proteomics can 
indeed lead to enhanced protein identification. 
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